Phone-Level Prosody Modelling With GMM-Based MDN for Diverse and Controllable Speech Synthesis

نویسندگان

چکیده

Generating natural speech with a diverse and smooth prosody pattern is challenging task. Although random sampling phone-level distribution has been investigated to generate different patterns, the diversity of generated still very limited far from what can be achieved by humans. This largely due use uni-modal distribution, such as single Gaussian, in prior works modelling. In this work, we propose novel approach that models prosodies GMM-based mixture density network(MDN) then extend it for multi-speaker TTS using speaker adaptation transforms Gaussian means variances. Furthermore, show clone reference components produce prosodies. Our experiments on LJSpeech LibriTTS dataset proposed method MDN not only achieves significantly better than both single-speaker TTS, but also provides naturalness. The cloning demonstrate similarity comparable recent fine-grained VAE while target better.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Prosody Modelling for Syllable-based Speech Synthesis

Prosody model used in the syllable based speech synthesizer DEMOSTHENES is described in the paper. The paper focuses on the segmental structure, especially on the segmentation into rhythm units (prosodic phrases). Relations between prosodic segments and sentence constituents are also discussed.

متن کامل

Prosody annotation for corpus based speech synthesis

The paper concerns prosody annotation especially for application in a corpus based speech synthesis. In order to establish the rules of automatic intonation modelling, phonetically labeled speech database of 4 hours has been perceptually and acoustically analyzed. The speech material included different text types and prosodically rich phrases. The annotation of the speech database consists in p...

متن کامل

Prosody Aware Word-Level Encoder Based on BLSTM-RNNs for DNN-Based Speech Synthesis

Recent studies have shown the effectiveness of the use of word vectors in DNN-based speech synthesis. However, these word vectors trained from a large amount of text generally carry not prosodic information, which is important information for speech synthesis, but semantic information. Therefore, if word vectors that take prosodic information into account can be obtained, it would be expected t...

متن کامل

Prosody control in HMM-based speech synthesis

In HMM-based speech synthesis, trained statistical models (context-dependent HMMs) are used to predict duration and generate parameters like mel-cepstral coefficients, log F0 values, and bandpass voicing strengths using the maximum likelihood parameter generation algorithm including global variance (Toda et al, 2007). In the later stages, F0 parameters, bandpass voicing strengths, and the five ...

متن کامل

Prosody modelling in Czech text-to-speech synthesis

This paper describes data-driven modelling of all three basic prosodic features – fundamental frequency, intensity and segmental duration – in the Czech text-to-speech system ARTIC. The fundamental frequency is generated by a model based on concatenation of automatically acquired intonational patterns. Intensity of synthesised speech is modelled by experimentally created rules which are in conf...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing

سال: 2022

ISSN: ['2329-9304', '2329-9290']

DOI: https://doi.org/10.1109/taslp.2021.3133205